[7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued)#43209
Conversation
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
…ch stable ABI Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
…ed_add_rms_norm_static_fp8_quant) to torch stable ABI Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
…libtorch_stable Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com>
…herry-pick merge conflict Signed-off-by: Chris Leonard <chleonar@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request migrates several CUDA kernels, including layernorm, quantization, and positional encoding, to the libtorch_stable directory. The refactoring replaces standard Torch and ATen dependencies with stable alternatives such as torch::stable::Tensor, STD_TORCH_CHECK, and stable device/stream utilities. The PR also introduces new dispatch macros in dispatch_utils.h. Feedback was provided regarding the VLLM_STABLE_DISPATCH_GROUP_SIZE macro, which lacks error handling for unsupported group sizes, potentially leading to silent failures.
| #define VLLM_STABLE_DISPATCH_GROUP_SIZE(group_size, const_group_size, ...) \ | ||
| if (group_size == 128) { \ | ||
| constexpr int const_group_size = 128; \ | ||
| __VA_ARGS__(); \ | ||
| } else if (group_size == 64) { \ | ||
| constexpr int const_group_size = 64; \ | ||
| __VA_ARGS__(); \ | ||
| } |
There was a problem hiding this comment.
The VLLM_STABLE_DISPATCH_GROUP_SIZE macro does not handle unsupported group_size values. This could lead to silent failures where no code is executed if the group_size is not 64 or 128. For improved robustness and consistency with other dispatch macros in this file (like VLLM_STABLE_DISPATCH_RANK234), an else block should be added to check for and handle unsupported values.
| #define VLLM_STABLE_DISPATCH_GROUP_SIZE(group_size, const_group_size, ...) \ | |
| if (group_size == 128) { \ | |
| constexpr int const_group_size = 128; \ | |
| __VA_ARGS__(); \ | |
| } else if (group_size == 64) { \ | |
| constexpr int const_group_size = 64; \ | |
| __VA_ARGS__(); \ | |
| } | |
| #define VLLM_STABLE_DISPATCH_GROUP_SIZE(group_size, const_group_size, ...) \ | |
| if (group_size == 128) { \ | |
| constexpr int const_group_size = 128; \ | |
| __VA_ARGS__(); \ | |
| } else if (group_size == 64) { \ | |
| constexpr int const_group_size = 64; \ | |
| __VA_ARGS__(); \ | |
| } else { \ | |
| STD_TORCH_CHECK(false, "Unsupported group_size, expected 64 or 128 but got: ", group_size); \ | |
| } |
| "rotary_embedding(Tensor positions, Tensor! query," | ||
| " Tensor!? key, int head_size," | ||
| " Tensor cos_sin_cache, bool is_neox, int " | ||
| "rope_dim_offset=0, bool inverse=False) -> ()"); |
There was a problem hiding this comment.
Merge conflicts location: addition of the "rope_dim_offset=0"
| int64_t head_size, | ||
| torch::Tensor& cos_sin_cache, // [max_position, rot_dim] | ||
| torch::stable::Tensor& cos_sin_cache, // [max_position, rot_dim] | ||
| bool is_neox, int64_t rope_dim_offset, bool inverse) { |
There was a problem hiding this comment.
Merge conflicts location: addition of the "rope_dim_offset"
| } | ||
| }); | ||
| }); | ||
| const torch::stable::accelerator::DeviceGuard device_guard( |
There was a problem hiding this comment.
Merge conflicts location: addition of the "rope_dim_offset"
| bool is_neox, torch::Tensor& position_ids, | ||
| int64_t forced_token_heads_per_warp); | ||
|
|
||
| void fused_deepseek_v4_qnorm_rope_kv_rope_quant_insert( |
There was a problem hiding this comment.
Merge conflicts location: addition of the "fused_deepseek_v4_qnorm_rope_kv_rope_quant_insert"
| "int num_heads_k, int num_heads_v, int head_dim, float eps, " | ||
| "Tensor q_weight, Tensor k_weight, Tensor cos_sin_cache, " | ||
| "bool is_neox, Tensor position_ids, " | ||
| "int forced_token_heads_per_warp=-1) -> ()"); |
There was a problem hiding this comment.
Merge conflicts location: addition of the "forced_token_heads_per_warp=-1"
| #include <ATen/cuda/CUDAContext.h> | ||
| #include <c10/cuda/CUDAGuard.h> | ||
| #include "torch_utils.h" | ||
|
|
There was a problem hiding this comment.
Merge conflicts location: deletion of #include <ATen/cuda/CUDAContext.h>, and
#include "async_util.cuh", two headers that were not there before
| torch::Tensor& cos_sin_cache, // Cos/sin cache [max_position, head_dim] | ||
| bool is_neox, // Whether RoPE is applied in Neox style | ||
| torch::Tensor& position_ids, // Position IDs for RoPE [num_tokens] | ||
| int64_t forced_token_heads_per_warp // -1 = auto-select, >0 = forced value |
There was a problem hiding this comment.
Merge conflicts location: addition of the "forced_token_heads_per_warp"
Is it possible that you also migrate these CPU kernels to libtorch stable ABI? |
| const torch::stable::accelerator::DeviceGuard device_guard( | ||
| qkv.get_device_index()); | ||
| auto stream = get_current_cuda_stream(qkv.get_device_index()); | ||
|
|
There was a problem hiding this comment.
Merge conflicts location: addition of device_id and stream.
| std::optional<torch::Tensor> residual, | ||
| int64_t group_size, bool is_scale_transposed); | ||
|
|
||
| void silu_and_mul_per_block_quant(torch::Tensor& out, |
There was a problem hiding this comment.
Merge conflicts location: deletion of #ifdef USE_ROCM
| #include "libtorch_stable/quantization/vectorization.cuh" | ||
| // TODO(luka/varun):refactor common.cuh to use this file instead | ||
| #include "../w8a8/fp8/common.cuh" | ||
| #include "../../../quantization/w8a8/fp8/common.cuh" |
There was a problem hiding this comment.
Merge conflicts location: #include "quantization/w8a8/fp8/common.cuh" was updated to #include "../w8a8/fp8/common.cuh" (both replaced by #include "../../../quantization/w8a8/fp8/common.cuh")
@Harry-Chen, our plan is to eventually move CPU to stable ABI as well. But first, we want to migrate the CUDA build completely onto to the libtorch stable ABI before we begin with the CPU migration. |
Makes sense. Then could you add some comments to the original header file explaining this? |
|
@Harry-Chen Also, there was one issue during the merge where I missed an update from TORCH_CHECK to STD_TORCH_CHECK. I have it fixed locally but I doing a quick check to make sure there wasn't another miss before I push the update. I should be able to re-push in a few minutes and will let you know when it's ready! |
| @@ -691,7 +690,7 @@ void launchFusedQKNormRopeNTokenHeads( | |||
| }); \ | |||
| break; \ | |||
| default: \ | |||
| TORCH_CHECK(false, "Unsupported head dimension: ", head_dim); \ | |||
There was a problem hiding this comment.
This block wasn't here before so I had to update TORCH_CHECK.
|
@Harry-Chen, I added the comments you mentioned. If they look good to you, can you add the 'ready' label? |
janeyx99
left a comment
There was a problem hiding this comment.
Confirmed that this is the same as the PR before, with proper fixes to merge conflicts. LGTM
| VLLM_STABLE_DISPATCH_CASE_QUANT_TYPES(__VA_ARGS__)) | ||
|
|
||
| // Group size dispatch (pure C++ if/else, no ATen dependency) | ||
| #define VLLM_STABLE_DISPATCH_GROUP_SIZE(group_size, const_group_size, ...) \ |
There was a problem hiding this comment.
Outside scope of this PR, but technically this is exactly the same as #define VLLM_DISPATCH_GROUP_SIZE so we could use the stable version everywhere now
|
TIL: torch-abi-audit. Very nice tool! |
…l_block_quant.cu of ../w8a8/fp8/common.cuh (included already through libtorch_stable/quantization/fused_kernels/quant_conversions.cuh) was causing redefinition errors in ROCm build due to hipify process. Removing redundant inclusion. Signed-off-by: Chris Leonard <chleonar@redhat.com>
Head branch was pushed to by a user without write access
Yes it is! We have @rgommers to thank for that! |
|
@Harry-Chen I don't think the current errors are due to this PR. Should I try to rebase onto main and try again? |
|
Also, @Harry-Chen, are you able to give me buildkite access for vllm? This would be helpful in being able to do more with the CI myself, like rerunning failed jobs. |
|
Thanks for the ping, glad you're finding |
Sure. And if we are stuck on flaky CI tests, we can also ask for a force merge.
CC @khluu on this |
…ontinued) (vllm-project#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
…ontinued) (vllm-project#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
* [XPU] add gptq(int4) support (#37844) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [UX] Add a persistent cache for FlashInfer autotuning (#42537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * add cutedsl dsv4 indexer fp8 kernel (#42899) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> * [ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Docs] Add SVG images for pooling models. (#42626) Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [XPU] Use custom op collective behavior (#41354) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Aligning tokwise pooler heads for consistency (#43041) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Docs] Reorganize online serving docs. (#41907) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Consolidate beam search by BeamSearchMixin. (#42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KVConnector][DSV4] HMA support for Mooncake store connector (#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Misc][MM] Remove redundant code in CLIPAttention (#43046) Signed-off-by: shen-shanshan <467638484@qq.com> * [CI] Add MTP + PD disagg test for Qwen3.5 (#42677) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * [Perf][4/n] Eliminate various GPU<->CPU syncs (#42347) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [XPU] update xpu graph usage (#43043) Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> * [Model] Openvla support (#42654) Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> * [Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Misc] add humming to dependencies (#42540) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> * [feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * [Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> * [CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Move language models tests (hybrid) back to L4 (#43129) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Model] Support post-norm architecture for EAGLE-3 supeculators (#42764) Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> * Fix error in Dynamic NTK scaling (#41277) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [CPU][DOC] Fix installation commands for Arm CPUs (#43115) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> * [bug] fix WeightTransferConfig.backend to allow for all strings (#43121) Signed-off-by: ahao-anyscale <ahao@anyscale.com> * [MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Cohere] Enable Cohere MoE (#43143) Signed-off-by: Terrencezzj <terrence@cohere.ai> * [Perf][Bugfix] Update dflash aux layer indexing (#40727) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * add enqueue all option to throughput benchmark (#42975) Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Perf] Avoid forward scan for async output placeholders (#42938) * [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111) Signed-off-by: mgoin <mgoin64@gmail.com> * [KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ci] Revert model executor test back to L4 (#43188) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099) Signed-off-by: NickLucche <nlucches@redhat.com> * [Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097) Signed-off-by: NickLucche <nlucches@redhat.com> * [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * Enable mermaid diagrams in the docs (#43192) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [GDN] Enable FI Blackwell GDN prefill kernel (#40717) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [XPU][CI] Add 2 server model test files in Intel GPU CI (#42499) Signed-off-by: zengxian <xiangdong.zeng@intel.com> * [Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330) Signed-off-by: hallerite <git@hallerite.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Doc] Sync CLI guide with actual help modes and launch subcommand (#40326) Signed-off-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> * [Feature] Support manually enabling the cumem allocator (#33648) Signed-off-by: Kebe <mail@kebe7jun.com> * [Spec Decode] Support non-MTP speculation for NemotronH (#43130) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * Remove additional dead code as a follow-up to #42889 (#43144) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> * [Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452) Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215) Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [R3] Add routed experts to openai entrypoint (#38939) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: NickLucche <nlucches@redhat.com> * Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082) Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * [Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> * [Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add dllehr-amd to CODEOWNERS and committers list (#42772) Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> * [Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [ROCm] Add QuickReduce min-size override and codec threshold (#41675) Signed-off-by: <> * [CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255) Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * update GPU json file based on h200 recipes (#43262) Signed-off-by: louie-tsai <louie.tsai@intel.com> * [Minor] Bigger overlap for FI AR (#43103) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329) Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082) Signed-off-by: haosdent <haosdent@gmail.com> * [Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973) Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: <> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [Frontend][RFC] Rust front-end integration (#40848) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905) Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> * [Benchmark] Add num-warmup to vllm bench throughput (#43245) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [CI] De-flake test_models for bigscience/bloom-560m (#43197) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU] add setuptools-rust for xpu dependency (#43287) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * Update KDA chunk prefill decay to use exp2 semantics (#43195) Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> * Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223) Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> * [Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * [CI] Pin protoc binary in rust-build stages (#43292) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943) Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CI] Fix CPU tests failing on `tl.exp2` import (#43311) Signed-off-by: haosdent <haosdent@gmail.com> * [Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079) Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> * [Perf] `zeros` -> `empty` to remove additional fill (#42988) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Core] Add native ModelExpress load format (#43105) Signed-off-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038) Signed-off-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Frontend] Rework fastokens integration (#43168) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968) Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BugFix] Use correct logprobs for `logprob_token_ids` (#43125) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873) Signed-off-by: Lanze Liu <lanzetech@gmail.com> * [Rust Frontend] Move code from `vllm-frontend-rs` (#43283) Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Eric Curtin <eric.curtin@docker.com> Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Will.hou <willamhou@ceresman.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Eric Curtin <eric.curtin@docker.com> Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Will.hou <willamhou@ceresman.com> Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history. * [CI] Fix dockerfile dependency graph failure for pre-commit (#43378) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855) Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> * [ROCm] Add XGMI backend for MoRI Connector (#41753) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [ROCm][CI] add warmup to mem_util test before measurement (#43236) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Frontend] Add truncation side to OpenAI endpoints (#43260) Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> * [Frontend] DP Supervisor (#40841) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Make CuMemAllocator free callback stream-aware (#43020) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [XPU] Enable multiple key kernels for sparse attention (#37888) Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064) Signed-off-by: haosdent <haosdent@gmail.com> * [Model] Use `AutoWeightsLoader` for Voyage (#42972) Signed-off-by: Furkan Fidan <dev@yufufi.com> * [Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213) Signed-off-by: tc-mb <tianchi_cai@icloud.com> * [CI] Fix test_lora_with_spec_decode on V2 model runner (#43314) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296) Signed-off-by: haosdent <haosdent@gmail.com> * Correcting the mock classes for MM GC tests (#43321) Signed-off-by: Weida Hong <wdhongtw@google.com> * [BugFix] Fix setuptools-rust dep in requirements files (#43377) Signed-off-by: Nick Hill <nickhill123@gmail.com> * Fix the docker build failure in tpu-inference (#43360) Signed-off-by: mrjunwan-lang <mrjunwan@google.com> * [Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CPU] Experimentally enable Triton and MRV2 (#43225) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Attention] Mamba attention module refactor (#41126) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Replace assert with proper exceptions for security and validation in pooling (#43286) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001) Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> * [BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118) Signed-off-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Keyi Li <likey6688@gmail.com> * [LoRA] Reduce memory of 2D weights when EP is set (#42737) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [EPLB] Change default EPLB communicator (#43110) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> * [CI] Fix AMD docker build tests (#43329) Signed-off-by: haosdent <haosdent@gmail.com> * Add NVFP4 MOE support for Deepseek V4. (#42209) Signed-off-by: Shiyang Chen <shiychen@nvidia.com> * [Multimodal] Simplify ViT CUDA graph interfaces (#41234) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650) Signed-off-by: zhanda <zhandazhu@gmail.com> Co-authored-by: Shang Wang <shangw@nvidia.com> * [KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149) * [Frontend] Simplify AuthenticationMiddleware path extraction (#43426) Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [RFC][EPLB][#32028] Remove dead torch.accelerator.synchronize() from sync path (#40733) Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> * [Bugfix] Detect wrong libcute_dsl_runtime.so variant in FlashInfer GDN (#43427) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [Bugfix] Clear error message for FP8 torchao quantization on unsupported GPUs (#36854) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * mhc_post - remove sts & add vectorized copies (#43437) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566) Signed-off-by: Juhi Mittal <juhim@nvidia.com> * [Model Runner V2] Support sharing kv cache layers (#35045) Signed-off-by: Nick Hill <nickhill123@gmail.com> * DSv4 fused Q-norm kernel grid refactor (#42353) * [Perf] Optimize hidden state extraction logic (#37374) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [XPU]fix: add XPU platform guards to DeepSeek-V4 ops (#42950) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * elastic_ep: stage/commit MoE quant method on reconfigure (#40881) Signed-off-by: Itay Alroy <ialroy@nvidia.com> * [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend (#38822) * Add `model` to `WeightTransferEngine.__init__` (#42922) Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSV4] More multi-stream enablement for c4a (#42925) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Stabilize runner teardown between sampler tests (#43023) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm][CI] Stabilize Granite tool-use and test URL construction (#43017) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix] Auto-raise max_num_batched_tokens for prefix-LM multimodal models (#43051) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> * [ROCm][CI] Fix ROCm LoRA Transformers fallback with full CUDA graphs (#41577) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU]feat: enable FP8 block-scaled quantization on XPU (#42952) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [XPU] reudce host overhead of XPU MOE (#42915) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Misc] Added missing return type annotations to improve mypy and IDE tooling (#43383) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Bugfix] Fix native Triton top-k/top-p kernel assumes contiguous logi… (#42739) Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> * [ModelOpt] Support Qwen3.5/3.6 VLM quantized prefix mapping (#42546) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * Keep scheduler alive for delayed KV connector frees (#43433) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix(eagle3): read norm_before_fc from eagle_config for NVIDIA checkpoint (#42143) Signed-off-by: FERRARIZHENG <popkart06@gmail.com> * [Kernel] Batch invariant NVFP4 linear using cutlass (#39912) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Remove benchmarks test group and shard long test groups (#41669) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Frontend] Fix input_audio parsing when uuid is present (#43414) Signed-off-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] build-rocm-wheels.yml: reduce MAX_JOBS to prevent OOM Signed-off-by: <callumm@amd.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: george <george@inferact.ai> Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: junyanxu <junyanxu5513@gmail.com> Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: ZhanqiuHu <zhu@redhat.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Terrencezzj <terrence@cohere.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com> Signed-off-by: hallerite <git@hallerite.com> Signed-off-by: Rui Wang <raygorous@gmail.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: <> Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Signed-off-by: louie-tsai <louie.tsai@intel.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Zheng Luo <zheluo@nvidia.com> Signed-off-by: Ylang Tsou <ylangt@google.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Furkan Fidan <dev@yufufi.com> Signed-off-by: tc-mb <tianchi_cai@icloud.com> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mrjunwan-lang <mrjunwan@google.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Ma Jian <jian1.ma@intel.com> Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> Signed-off-by: Keyi Li <likey6688@gmail.com> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: Shiyang Chen <shiychen@nvidia.com> Signed-off-by: zhanda <zhandazhu@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: FERRARIZHENG <popkart06@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: ffggs <314137448@qq.com> Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: <callumm@amd.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: gnovack <gnovack@amazon.com> Co-authored-by: george <george@inferact.ai> Co-authored-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Junyan Xu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com> Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com> Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: Izik Golan <47969623+izikgo@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: xiangdong <40376367+zxd1997066@users.noreply.github.com> Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Ray Wang <roguerui6@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: haosdent <haosdent@gmail.com> Co-authored-by: meena-at-work <80416898+meena-at-work@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Ace Eldeib <alexeldeib@gmail.com> Co-authored-by: Louie Tsai <louie.tsai@intel.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: sonusflow <git@sonusflow.pl> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: 손세정 <maze0717@g.skku.edu> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: Yiyang "Ian" Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: ylangtsou <149562838+ylangtsou@users.noreply.github.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: fangyuchu <fangyuchu@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: ruizhang <rza21.bc@gmail.com> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Furkan F <id+git@yufufi.com> Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: mrjunwan-lang <mrjunwan@google.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Tobias Wasner <wasnertobias@users.noreply.github.com> Co-authored-by: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com> Co-authored-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: sychen52 <41452870+sychen52@users.noreply.github.com> Co-authored-by: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com> Co-authored-by: Shang Wang <shangw@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: Juhi Mittal <39641197+juhi10071998@users.noreply.github.com> Co-authored-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com> Co-authored-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Xiaogang Zhou <zhou16386@163.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: GuangYaoZheng <popkart06@gmail.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com>
…ontinued) (vllm-project#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…ontinued) (vllm-project#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
…ontinued) (vllm-project#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
…ontinued) (vllm-project#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
This is a continuation of the PR #38783
cc @janeyx99
Purpose
Stacked on #38757, commits to review https://github.com/vllm-project/vllm/pull/38783/changes/deea6618c38afb4735b442c61e2697c273654292..8754a4250584115db08113e0889313c939d85eb6Note: some declarations are not deleted from csrc/ops.h despite being moved to csrc/libtorch_stable/ops.h. This is because the CPU build also uses these declarations. These are
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Differences between this and PR #38783
Migration progress using the Audit Python extension torch-abi-audit:
Before update:
After update:
So 2 unstable extension were removed from _C.abi.so.